Welcome to the capstone project! The following page contains the tasks and all of the necessary information for you to start the capstone project. Provide documented code and figures for the problems below in PDF format.
All data can be accessed at here
You can download the data by using Linux/Ubuntu command window as well. To download the data, copy the link of the file from the HTML file provided to you and just type wget link.
For example: wget https://bimsbstatic.mdc-berlin.de/akalin/AAkalin_DrugResponse/prepared/AAkalin_DrugResponse-prepared.tar.gz Typing this command will download the entire data folder.If you’re having any issues on wget command, you can get more information from the link here
Another example: You can download the data with wget and assign a specific name by typing wget -O CCLE_gex.zip https://bimsbstatic.mdc-berlin.de/akalin/AAkalin_DrugResponse/prepared/CCLE/all_intersect_genes/gex.tsv.gz. Typing this command will download the gene expression data from the CCLE with a specific name CCLE_gex.
Once you download the data, you need to unzip it.You can use gunzip() function to unzip the file with .gz extension. Additionally, you may also need to use R.utils::untar() function to untar the .tar files. All of the data is in .tsv format. There is no significantly difference between the .csv and .tsv format, so you can simply read .tsv by specifying sep = '\t', or by using read.delim() function.
For example: read.delim(R.utils::gunzip("data/CCLE/drug_response/TRAMETINIB.tsv.gz")) and you can assign the data you read to a variable.
From the code below, you can see how the PCA graph of gene expression is plotted by coloring it based on the drug.
#read the gene exp data
CCLE_gex <- read.delim("data/CCLE/all_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
# with the dimension of 1305 x 18356
#read the drug response data
drug_resp_CCLE <- read.delim("data/CCLE/drug_response/ALPELISIB.tsv") %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/BUPARLISIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/CETUXIMAB.tsv")) %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/ERLOTINIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/RIBOCICLIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/TRAMETINIB.tsv")) %>%
rename(drug = column_name, cell_lines = sample_id)
#with the dimension of 4190 x 5
#Remove near zero variation
nzv <- caret::preProcess(CCLE_gex,method="nzv",uniqueCut = 15)
CCLE_gex_pp <- predict(nzv,CCLE_gex)
#with the dimension of 3045 x 16023
#center scale and pca
preprocessParams <- caret::preProcess(CCLE_gex_pp,method=c("pca"))
CCLE_gex_pp <- predict(preprocessParams,CCLE_gex_pp)
#combine the drug response values
ccle_data <- dplyr::inner_join(drug_resp_CCLE, CCLE_gex_pp, by = "cell_lines") %>%
select(!data_type & !variable_type)
#with the dimension of 3045 x 18358
#Save the plots
ALPpcaplot <- ccle_data %>% filter(drug == "ALPELISIB") %>%
ggplot(aes(x=PC1,y=PC2))+
geom_point(aes(color =value))+
ggtitle("ALPELISIB")+
theme(legend.position = "right")+
scale_color_gradient(low = "orange", high = "Blue")
BUPpcaplot <- ccle_data %>% filter(drug == "BUPARLISIB") %>%
ggplot(aes(x=PC1,y=PC2))+
geom_point(aes(color =value))+
ggtitle("BUPARLISIB")+
theme(legend.position = "right")+
scale_color_gradient(low = "orange", high = "Blue")
CETpcaplot <- ccle_data %>% filter(drug == "CETUXIMAB") %>%
ggplot(aes(x=PC1,y=PC2))+
geom_point(aes(color =value))+
ggtitle("CETUXIMAB")+
theme(legend.position = "right")+
scale_color_gradient(low = "orange", high = "Blue")
ERLpcaplot <- ccle_data %>% filter(drug == "ERLOTINIB") %>%
ggplot(aes(x=PC1,y=PC2))+
geom_point(aes(color =value))+
ggtitle("ERLOTINIB")+
theme(legend.position = "right")+
scale_color_gradient(low = "orange", high = "Blue")
RIBpcaplot <- ccle_data %>% filter(drug == "RIBOCICLIB") %>%
ggplot(aes(x=PC1,y=PC2))+
geom_point(aes(color =value))+
ggtitle("RIBOCICLIB")+
theme(legend.position = "right")+
scale_color_gradient(low = "orange", high = "Blue")
TRApcaplot <- ccle_data %>% filter(drug == "TRAMETINIB") %>%
ggplot(aes(x=PC1,y=PC2))+
geom_point(aes(color =value))+
ggtitle("TRAMETINIB")+
theme(legend.position = "right")+
scale_color_gradient(low = "orange", high = "Blue")
cowplot::plot_grid(ALPpcaplot,BUPpcaplot,CETpcaplot,ERLpcaplot,RIBpcaplot,TRApcaplot)There are five main sections in the data file, which are CCLE, PDX_ALL, PDX_BRCA, GDSC_broad and GDSC_sanger. Each of these main folders contains three subfolders, as indicated in the README file as well. The all_intersect_genes folder contains all of the genomic data which are the predictive features that we will be using in our machine learning model. The drug_response folder contains the drug response data which is the continuous response variable that we will a training model to predict. There is six different drug response data present within the drug_response folder.
The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. The CCLE provides public access to genomic data, visualization, and analysis for over 1100 cancer cell lines. You can get more info on the CCLE from here Ghandi et al. (2019)
Genomics of Drug Sensitivity in Cancer (GDSC) is the largest public resource database for information on drug sensitivity in cancer cells and molecular markers of drug response. Data are freely available without restriction. GDSC currently contains drug sensitivity data for more than 75,000 experiments, describing the response to 138 anticancer drugs across more than 700 cancer cell lines. You can get more info on the GDSC from here Yang et al. (2012)
The data within The CCLE and The GDSC databases are kinds of CMA data. CMA stands for Cell Microarray. CMAs are the tumor samples that are grown in the lab environment. CMAsare poor representations of actual tumors despite using so widely in researches. The reason for this is the way CMAs are produced.
When a biopsy sample is taken from a patient, this sample is injected into mice in the lab. After the tumor develops again on mice, it is taken from the mouse and divided into slices and planted on plates, and cultured. Tumors are solid structures that consist of cells that are completely focused on division. The difference between CMA and tumor structures because only from the growth and division of cells located on the outside of the plate. The most important reason for this is that the cells in the internal structure of the tumor have limited access to oxygen. As a result, CMA is extremely useful in cancer research, but does not fully represent tumor cell structure.
PDX, on the other hand, represents the tumor structure better than CMAs. So don’t be surprised if you obtain different results on PDX data. PDX stands for Patient-Derived Xenograft. PDX is the re-creation of a tumor sample is taken by biopsy from the patient by injecting it into mice. We use the PDX data to obtain more meaningful results clinically. In Figure 1 below you can see the overall representation for using PDX to test drug efficacy. Jung, Seol, and Chang (2017) And the cell cultivation methodologies are represented in Figure 2. Ravi, Ramesh, and Pattabhi (2016)
The Gene expression measurement is usually achieved by quantifying levels of the gene product, which is often a protein. Two common techniques used for protein quantification include Western blotting and enzyme-linked immunosorbent assay or ELISA.Cancer can be described as a disease of altered gene expression. There are many proteins that are turned on or off (gene activation or gene silencing) that dramatically alter the overall activity of the cell. A gene that is not normally expressed in that cell can be switched on and expressed at high levels.
NOTE that the gene expression data was transformed. (values: log10 FPKM + 0.01)
CNV stands for Copy Number of Variation. A CNV is when the number of copies of a particular gene vary from one individual to the next. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes. For more information you can click the link here
A mutation is a change. It creates an abnormal protein. Or it may prevent a protein’s formation. An abnormal protein provides different information than a normal protein. This can cause cells to multiply uncontrollably and become cancerous.
For more information on variant calling comparison between the CCLE and the GDSC link is here
Drug response data indicates the drug efficacy in cell lines. They are widely used in cancer research to prioritize drugs to be tested in patients. You’ll notice the metric of the drug response is AUC. In pharmacology, the area under the plot of plasma concentration of a drug versus time after dosage (called “area under the curve” or AUC) gives insight into the extent of exposure to a drug and its clearance rate from the body. AUC combines information about drug efficacy and potency into a single value, and it was reported to be the most robust metric previously. To learn more on response metrics click this link
Your tasks include:
Does each omic dataset need to be normalized with the same method or different normalization methods for different omic data changes CV and PDX performance? What are the best normalization methods ?
Take the best or the combination of models you created for cell lines and predict drug response on the PDX data set. Plot metrics for each drug.
Which accuracy metric can you use for that purpose?
HINT : Again, you will probably need to jointly normalize data sets to be able to use them in training and validation.
IMPORTANT NOTE the PDX data set is for validation ONLY and it shouldn’t be used for training the model.
PDX validationresults are very different compared to cell lines or cross-validation on cell lines. This could be due to many different things, can you identify and remedy this problem? Show improvement on your previous PDX validation accuracy by employing your proposed solution.Below are all the necessary summary statistics about the data you will use to complete tasks. A summary of drug response data is given at the end of the document.
The dimension of the gene expression data is 1305 x 18357. There are also no missing values present within the dataset. Since the dimensions of the data was high, only the first fifthy columns were analyzed statistically.
#read the data
CCLE_gex <- read.delim("data/CCLE/all_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(CCLE_gex[,2:50]),xlab="Gene Expression values",
main="The CCLE Gene Expression ",border="white",
col="cornflowerblue",bins = 50)#Plot the boxplots for the Gene Expression values of the first 50 columns
boxplot(CCLE_gex[,2:50],outline=FALSE, col="cornflowerblue",
xlab = "Gene Expression values",xaxt="n",
ylab = "Frequency",main = "The CCLE Gene Expression")#check the missing values of the data set
CCLE_gex[CCLE_gex == "NA"] <- NA
table(is.na(CCLE_gex)) #FALSE:23954580
#There are no missing valuesThere are no missing values present. The interactive heatmap of the CNV data was plotted. The dimension of the CNV data is 1745 x 18356 .
#read the data
CCLE_cnv <- read.delim("data/CCLE/all_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="Blues"))(12)
pheatmap::pheatmap(CCLE_cnv[,2:50],cluster_rows=FALSE,show_rownames = FALSE,main="Copy Number Data from the CCLE",color = cols)#check the missing values of the data set
CCLE_cnv[CCLE_cnv == "NA"] <- NA
table(is.na(CCLE_cnv)) #FALSE:32031220
#There are no missing valuesBelow, the variant classification, top 10 mutated genes, variants per sample plots were plotted.
NOTE : TheCCLE mutation data contain missing values, so you have to deal with them in order to use.
CCLE_mutdelim <- read.delim(file = "data/CCLE/all_intersect_genes/mut.tsv",
sep = "\t", header = TRUE, fill = TRUE,
comment.char = "#") #read the mutation data
CCLE_mutdelim$Tumor_Seq_Allele2 <- CCLE_mutdelim$Tumor_Seq_Allele1
CCLE_mut <- read.maf(maf= CCLE_mutdelim) #to convert and read it as maf file## -Validating
## --Removed 6264 duplicated variants
## --Non MAF specific values in Variant_Classification column:
## 5UTR
## 5Flank
## Start_Codon_Del
## Stop_Codon_Ins
## 3UTR
## -Silent variants: 355334
## -Summarizing
## --Possible FLAGS among top ten genes:
## TTN
## MUC16
## OBSCN
## SYNE1
## USH2A
## -Processing clinical data
## --Missing clinical data
## -Finished in 27.0s elapsed (26.7s cpu)
#Plot the summary of mutation data
plotmafSummary(maf = CCLE_mut, rmOutlier = TRUE, addStat = 'median',
dashboard = TRUE,titvRaw = FALSE)#check the missing values of the data set
CCLE_mutdelim[CCLE_mutdelim == "NA"] <- NA
table(is.na(CCLE_mutdelim)) #FALSE: 43439347 , TRUE: 886853
#There are missing valuesA histogram and a bar plot were plotted for representing the frequency of the gene expression of the GDSC BROAD dataset. There are no missing values within the dataset. Check the distribution of data, do the data appear to have a normal distribution? The dimensions of the gene expression data is 613 x 18356.
#read the data
BROAD_gex <- read.delim("data/GDSC_broad/all_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(BROAD_gex[,2:50]),xlab="Gene Expression values",
main="The GDSC BROAD Histogram ",border="white",
col="Green",bins = 50)#Plot the boxplots for the Gene Expression values of the first 50 columns
boxplot(BROAD_gex[,2:50],outline=FALSE, col="Green",
xlab = "Gene Expression values",xaxt="n",
ylab = "Frequency",main = "The GDSC BROAD BoxPlot")#check the missing values of the data set
BROAD_gex[BROAD_gex == "NA"] <- NA
table(is.na(BROAD_gex)) #FALSE:11252228
#There are no missing valuesThe interactive heatmap of the CNV data was plotted. There are no missing values present within the dataset.The dimensions of the CNV data is 891 x 18356.
#read the data
BROAD_cnv <- read.delim("data/GDSC_broad/all_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="Greens"))(12)
pheatmap::pheatmap(BROAD_cnv[,2:50],cluster_rows=FALSE,
show_rownames = FALSE,
main="Copy Number Data from the GDSC BROAD",col = cols)#check the missing values of the data set
BROAD_cnv[BROAD_cnv == "NA"] <- NA
table(is.na(BROAD_cnv)) #FALSE:11252228
#There are no missing valuesThe frequency of each variant type and, top 10 mutated genes were plotted.
NOTE : There are missing values present within the data.
BROAD_mutdelim <- data.table::fread(
file = "data/GDSC_broad/all_intersect_genes/mut.tsv",sep = "\t",
header = TRUE, fill = TRUE) #read the mutation data
genesum <- maftools::getGeneSummary(MAF(BROAD_mutdelim)) %>% group_by(Hugo_Symbol) %>% select(Hugo_Symbol,AlteredSamples) %>%
arrange() %>% head(10)## -Processing clinical data
## --Missing clinical data
Variant_Type <- BROAD_mutdelim %>% group_by(Variant_Type) %>%
count() %>% rename(frequency = n)
var_plot <- ggplot(data=Variant_Type, aes(x=Variant_Type, y = frequency, fill=Variant_Type))+
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Variant Type") +
ylab("Frequency") +
xlab("Variant Type")
ggplot(data=genesum, aes(x=reorder(Hugo_Symbol, AlteredSamples), y = AlteredSamples, fill=Hugo_Symbol))+
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Top 10 Mutated Genes Based on Altered Samples") +
ylab("Number of Altered Samples") +
xlab("Gene Names")#check the missing values of the data set
BROAD_mutdelim[BROAD_mutdelim == "NA"] <- NA
table(is.na(BROAD_mutdelim)) #FALSE: 22905359 , TRUE: 24115732
#There are missing valuesA histogram and a boxplot were plotted to represent the expression values of the first fifty genes. There are no missing values present within the dataset. Check the distribution of the gene expression values. The dimensions of the gene expression data is 384 x 18356.
#read the data
SANGER_gex <- read.delim("data/GDSC_sanger/all_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(SANGER_gex[,2:50]),xlab="Gene Expression values",
main="The GDSC SANGER Histogram ",border="white",
col="Orange",bins = 50)#Plot the boxplots for the Gene Expression values of the first 50 columns
boxplot(SANGER_gex[,2:50],outline=FALSE, col="Orange",
xlab = "Gene Expression values",xaxt="n",
ylab = "Frequency",main = "The GDSC SANGER BoxPlot")#check the missing values of the data set
SANGER_gex[SANGER_gex == "NA"] <- NA
table(is.na(SANGER_gex)) #FALSE:7048704
#There are no missing valuesThe interactive heatmap of the CNV data was plotted. There are no missing values present within the dataset. The dimensions of the CNV data is 891 x 18356.
#read the data
SANGER_cnv <- read.delim("data/GDSC_sanger/all_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="Oranges"))(12)
pheatmap::pheatmap(SANGER_cnv[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
main="Copy Number Data from the GDSC SANGER", col=cols)#check the missing values of the data set
SANGER_cnv[SANGER_cnv == "NA"] <- NA
table(is.na(SANGER_cnv)) #FALSE:16355196
#There are no missing valuesThe frequency of each variant type and, top 10 mutated genes were plotted.
NOTE : There are missing values present in the mutation data.
SANGER_mutdelim <-data.table::fread(
file = "data/GDSC_sanger/all_intersect_genes/mut.tsv",sep = "\t",
header = TRUE, fill = TRUE)#read the mutation data
genesum <- maftools::getGeneSummary(MAF(SANGER_mutdelim)) %>%
group_by(Hugo_Symbol) %>% select(Hugo_Symbol,AlteredSamples) %>% arrange() %>%
head(10)## -Processing clinical data
## --Missing clinical data
Variant_Type <- SANGER_mutdelim %>% group_by(Variant_Type) %>% count() %>%
rename(frequency = n)
var_plot <- ggplot(data=Variant_Type, aes(x=Variant_Type,
y = frequency, fill=Variant_Type))+
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Variant Type") +
ylab("Frequency") +
xlab("Variant Type")
ggplot(data=genesum, aes(x=reorder(Hugo_Symbol, AlteredSamples), y = AlteredSamples, fill=Hugo_Symbol))+
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Top 10 Mutated Genes Based on Altered Samples") +
ylab("Number of Altered Samples") +
xlab("Gene Names")ggplotly(var_plot)#check the missing values of the data set
SANGER_mutdelim[SANGER_mutdelim == "NA"] <- NA
table(is.na(SANGER_mutdelim)) #FALSE: 24111028 , TRUE: 24115732
#There are missing valuesThere are no missing values present within the PDX_all data set. Check the distribution of gene expression. How does it differ from the distribution of gene expression values in other datasets? The dimension of the gene expression data is 399 x 18356.
#read the data
PDX_gex <- read.delim("data/PDX_all/all_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(PDX_gex[,2:50]),xlab="The PDX Dataset",
main="The PDX Gene Expression ",border="white",
col="Purple")#Plot the boxplots for the Gene Expression values of the first 50 column
PDX_gex[,2:50] %>% boxplot(names = FALSE,outline=FALSE, col="Purple",
xlab = "Gene Expression values",ylim = c(0,400),
ylab = "Frequency",main = "The PDX Gene Expression")#check the missing values of the data set
PDX_gex[PDX_gex == "NA"] <- NA
table(is.na(PDX_gex)) #FALSE:23954580
#There are no missing valuesThere are no missing values present. The interactive heatmap of the CNV data was plotted. The dimension of the CNV data is 399 x 18356.
#read the data
PDX_cnv <- read.delim("data/PDX_all/all_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="Purples"))(12)
pheatmap::pheatmap(PDX_cnv[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
main="Copy Number Data from the PDX",color = cols)#check the missing values of the data set
PDX_cnv[PDX_cnv == "NA"] <- NA
table(is.na(PDX_cnv)) #FALSE:16355196
#There are no missing valuesThere are missing values present in the mutation data and, variant plots could not be drawn in the PDX data set due to the same reasons why variant plots could not be drawn in other data sets. Instead,the histogram of the mutation frequency values were plotted.
PDX_mutdelim <- read.delim(file = "data/PDX_all/all_intersect_genes/mut.tsv",
sep = "\t", header = TRUE, fill = TRUE,
comment.char = "#") #read the mutation data
hist(unlist(PDX_mutdelim[,84]),xlab="Mutation Values",
main="The PDX Mutation Frequency",border="white",
col="Purple",bins = 10)#check the missing values of the data set
PDX_mutdelim[PDX_mutdelim == "NA"] <- NA
table(is.na(PDX_mutdelim)) #FALSE: 806164 , TRUE: 8432975
#There are missing valuesThere are no missing values present within the PDX_BRCA dataset. The dimension of the gene expression data is 41 x 18356.
#read the data
PDX_BRCA_gex <- read.delim("data/PDX_BRCA/all_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(PDX_BRCA_gex[,2:50]),xlab="Gene Expression values",
main="The PDX BRCA Gene Expression ",border="white",
col="Blue", bins = 100)#Plot the boxplots for the Gene Expression values of the first 50 columns
boxplot(PDX_BRCA_gex[,2:50],outline=FALSE, col="Blue",
xlab = "Genes",xaxt="n",
ylab = "Frequency",main = "The PDX BRCA Gene Expression")#check the missing values of the data set
PDX_BRCA_gex[PDX_BRCA_gex == "NA"] <- NA
table(is.na(PDX_BRCA_gex)) #FALSE:752596
#There are no missing valuesThere are no missing values present. The interactive heatmap of the CNV data was plotted.The dimension of the CNV data is 41 x 18356.
#read the data
PDX_BRCA_cnv <- read.delim("data/PDX_BRCA/all_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of the cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="PuBu"))(12)
pheatmap::pheatmap(PDX_BRCA_cnv[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
main="Copy Number Data from the PDX BRCA", color = cols)#check the missing values of the data set
PDX_BRCA_cnv[PDX_BRCA_cnv == "NA"] <- NA
table(is.na(PDX_BRCA_cnv)) #FALSE:752596
#There are no missing valuesThere are missing values present in the mutation data and, variant plots could not be drawn in the PDX BRCA data set due to the same reasons why variant plots could not be drawn in other data sets. Instead,the histogram of the mutation frequency values were plotted.
PDX_BRCA_mutdelim <- read.delim(
file = "data/PDX_BRCA/all_intersect_genes/mut.tsv",sep = "\t", header = TRUE,
fill = TRUE,comment.char = "#") #read the mutation data
hist(unlist(PDX_BRCA_mutdelim[,84]),xlab="Mutation Values",
main="The PDX BRCA Mutation Frequency",border="white",
col= "Blue",bins = 10)If you ever look into the data folders, you may have noticed there is one more folder oncoKB_intersect_genes. This folder contains the panel sequencing data, which you will be using for part 2 tasks. Part 2 tasks are the same with Part 1 tasks, but you will be using panel sequencing data. In addition, you will be using the same drug response data that you used in the Part 1 tasks. There we would see if panel-sequencing works as well as multi-omics in terms of modeling drug response.
Your tasks include:
Does each panel sequencing dataset need to be normalized with the same method or different normalization methods for different panel sequencing data changes CV and PDX performance? What are the best normalization methods ?
Take the best or the combination of models you created for cell lines and predict drug response on the PDX data set. Plot metrics for each drug.
Which accuracy metric can you use for that purpose?
HINT : Again, you will probably need to jointly normalize data sets to be able to use them in training and validation.
IMPORTANT NOTE the PDX data set is for validation ONLY and it shouldn’t be used for training the model.
Below are all the necessary summary statistics about the data you will use to complete tasks. A summary of drug response data is given at the end of the document.
The dimension of the gene expression data is 1305 x 883. There are also no missing values present within the dataset. Since the dimensions of the data was high, only the first fifthy columns were analyzed statistically.
#read the data
CCLE_gex_pan <- read.delim("data/CCLE/oncoKB_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(CCLE_gex_pan[,2:50]),xlab="Gene Expression values",
main="The CCLE Gene Expression ",border="white",
col="cornflowerblue",bins = 50)#Plot the boxplots for the Gene Expression values of the first 50 columns
boxplot(CCLE_gex_pan[,2:50],outline=FALSE, col="cornflowerblue",
xlab = "Gene Expression values",xaxt="n",
ylab = "Frequency",main = "The CCLE Gene Expression")#check the missing values of the data set
CCLE_gex_pan[CCLE_gex_pan == "NA"] <- NA
table(is.na(CCLE_gex_pan)) #FALSE:1152315
#There are no missing valuesThere are no missing values present. The interactive heatmap of the CNV data was plotted. The dimension of the CNV data is 1745 x 883 .
#read the data
CCLE_cnv_pan <- read.delim("data/CCLE/oncoKB_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="Blues"))(12)
pheatmap::pheatmap(CCLE_cnv_pan[,2:50],cluster_rows=FALSE,show_rownames = FALSE,main="Copy Number Data from the CCLE",color = cols)#check the missing values of the data set
CCLE_cnv_pan[CCLE_cnv_pan == "NA"] <- NA
table(is.na(CCLE_cnv_pan)) #FALSE:1540835
#There are no missing valuesBelow, the variant classification, top 10 mutated genes, variants per sample plots were plotted.
NOTE : TheCCLE mutation data contain missing values, so you have to deal with them in order to use.
CCLE_mutdelim_pan <- read.delim(file = "data/CCLE/oncoKB_intersect_genes/mut.tsv",
sep = "\t", header = TRUE, fill = TRUE,
comment.char = "#") #read the mutation data
CCLE_mutdelim_pan$Tumor_Seq_Allele2 <- CCLE_mutdelim_pan$Tumor_Seq_Allele1
CCLE_mut_pan <- read.maf(maf= CCLE_mutdelim_pan) #to convert and read it as maf file## -Validating
## --Removed 522 duplicated variants
## --Non MAF specific values in Variant_Classification column:
## Stop_Codon_Ins
## 3UTR
## Start_Codon_Del
## 5Flank
## -Silent variants: 31994
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 1.330s elapsed (1.170s cpu)
#Plot the summary of mutation data
plotmafSummary(maf = CCLE_mut_pan, rmOutlier = TRUE, addStat = 'median',
dashboard = TRUE,titvRaw = FALSE)#check the missing values of the data set
CCLE_mutdelim_pan[CCLE_mutdelim_pan == "NA"] <- NA
table(is.na(CCLE_mutdelim_pan)) #FALSE: 3719170 , TRUE: 77150 ##
## FALSE TRUE
## 3719170 77150
#There are missing valuesA histogram and a bar plot were plotted for representing the frequency of the gene expression of the GDSC BROAD dataset. There are no missing values within the dataset. Check the distribution of data, do the data appear to have a normal distribution? The dimensions of the gene expression data is 613 x 886.
#read the data
BROAD_gex_pan <- read.delim("data/GDSC_broad/oncoKB_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(BROAD_gex_pan[,2:50]),xlab="Gene Expression values",
main="The GDSC BROAD Histogram ",border="white",
col="Green",bins = 50)#Plot the boxplots for the Gene Expression values of the first 50 columns
boxplot(BROAD_gex_pan[,2:50],outline=FALSE, col="Green",
xlab = "Gene Expression values",xaxt="n",
ylab = "Frequency",main = "The GDSC BROAD BoxPlot")#check the missing values of the data set
BROAD_gex_pan[BROAD_gex_pan == "NA"] <- NA
table(is.na(BROAD_gex_pan)) #FALSE:543118
#There are no missing valuesThe interactive heatmap of the CNV data was plotted. There are no missing values present within the dataset.The dimensions of the CNV data is 891 x 18356.
#read the data
BROAD_cnv_pan <- read.delim("data/GDSC_broad/oncoKB_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="Greens"))(12)
pheatmap::pheatmap(BROAD_cnv_pan[,2:50],cluster_rows=FALSE,
show_rownames = FALSE,
main="Copy Number Data from the GDSC BROAD",col = cols)#check the missing values of the data set
BROAD_cnv_pan[BROAD_cnv_pan == "NA"] <- NA
table(is.na(BROAD_cnv_pan)) #FALSE:786753
#There are no missing valuesThe frequency of each variant type and, top 10 mutated genes were plotted.
NOTE : There are missing values present within the data.
BROAD_mutdelim_pan <- data.table::fread(
file = "data/GDSC_broad/oncoKB_intersect_genes/mut.tsv",sep = "\t",
header = TRUE, fill = TRUE) #read the mutation data
genesum <- maftools::getGeneSummary(MAF(BROAD_mutdelim_pan)) %>% group_by(Hugo_Symbol) %>% select(Hugo_Symbol,AlteredSamples) %>%
arrange() %>% head(10)## -Processing clinical data
## --Missing clinical data
Variant_Type <- BROAD_mutdelim_pan %>% group_by(Variant_Type) %>%
count() %>% rename(frequency = n)
var_plot <- ggplot(data=Variant_Type, aes(x=Variant_Type, y = frequency, fill=Variant_Type))+
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Variant Type") +
ylab("Frequency") +
xlab("Variant Type")
ggplot(data=genesum, aes(x=reorder(Hugo_Symbol, AlteredSamples), y = AlteredSamples, fill=Hugo_Symbol))+
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Top 10 Mutated Genes Based on Altered Samples") +
ylab("Number of Altered Samples") +
xlab("Gene Names") ggplotly(var_plot)#check the missing values of the data set
BROAD_mutdelim_pan[BROAD_mutdelim_pan == "NA"] <- NA
table(is.na(BROAD_mutdelim_pan)) #FALSE: 2143308 , TRUE: 2256126
#There are missing valuesA histogram and a boxplot were plotted to represent the expression values of the first fifty genes. There are no missing values present within the dataset. Check the distribution of the gene expression values. The dimensions of the gene expression data is 384 x 886.
#read the data
SANGER_gex_pan <- read.delim("data/GDSC_sanger/oncoKB_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(SANGER_gex_pan[,2:50]),xlab="Gene Expression values",
main="The GDSC SANGER Histogram ",border="white",
col="Orange",bins = 50)#Plot the boxplots for the Gene Expression values of the first 50 columns
boxplot(SANGER_gex_pan[,2:50],outline=FALSE, col="Orange",
xlab = "Gene Expression values",xaxt="n",
ylab = "Frequency",main = "The GDSC SANGER BoxPlot")#check the missing values of the data set
SANGER_gex_pan[SANGER_gex_pan == "NA"] <- NA
table(is.na(SANGER_gex_pan)) #FALSE:340224
#There are no missing valuesThe interactive heatmap of the CNV data was plotted. There are no missing values present within the dataset. The dimensions of the CNV data is 891 x 883.
#read the data
SANGER_cnv_pan <- read.delim("data/GDSC_sanger/oncoKB_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="Oranges"))(12)
pheatmap::pheatmap(SANGER_cnv_pan[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
main="Copy Number Data from the GDSC SANGER", col=cols)#check the missing values of the data set
SANGER_cnv_pan[SANGER_cnv_pan == "NA"] <- NA
table(is.na(SANGER_cnv_pan)) #FALSE:786753
#There are no missing valuesThe frequency of each variant type and, top 10 mutated genes were plotted.
NOTE : There are missing values present in the mutation data.
SANGER_mutdelim_pan <-data.table::fread(
file = "data/GDSC_sanger/oncoKB_intersect_genes/mut.tsv",sep = "\t",
header = TRUE, fill = TRUE)#read the mutation data
genesum <- maftools::getGeneSummary(MAF(SANGER_mutdelim_pan)) %>%
group_by(Hugo_Symbol) %>% select(Hugo_Symbol,AlteredSamples) %>% arrange() %>%
head(10)## -Processing clinical data
## --Missing clinical data
Variant_Type <- SANGER_mutdelim_pan %>% group_by(Variant_Type) %>% count() %>%
rename(frequency = n)
var_plot <- ggplot(data=Variant_Type, aes(x=Variant_Type,
y = frequency, fill=Variant_Type))+
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Variant Type") +
ylab("Frequency") +
xlab("Variant Type")
ggplot(data=genesum, aes(x=reorder(Hugo_Symbol, AlteredSamples), y = AlteredSamples, fill=Hugo_Symbol))+
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Top 10 Mutated Genes Based on Altered Samples") +
ylab("Number of Altered Samples") +
xlab("Gene Names") ggplotly(var_plot)#check the missing values of the data set
SANGER_mutdelim_pan[SANGER_mutdelim_pan == "NA"] <- NA
table(is.na(SANGER_mutdelim_pan)) #FALSE: 2143308 , TRUE: 2256126
#There are missing valuesThere are no missing values present within the PDX_all data set. Check the distribution of gene expression. How does it differ from the distribution of gene expression values in other datasets? The dimension of the gene expression data is 399 x 883.
#read the data
PDX_gex_pan <- read.delim("data/PDX_all/oncoKB_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(PDX_gex_pan[,2:50]),xlab="The PDX Dataset",
main="The PDX Gene Expression ",border="white",
col="Purple")#Plot the boxplots for the Gene Expression values of the first 50 column
PDX_gex_pan[,2:50] %>% boxplot(names = FALSE,outline=FALSE, col="Purple",
xlab = "Gene Expression values",ylim = c(0,400),
ylab = "Frequency",main = "The PDX Gene Expression")#check the missing values of the data set
PDX_gex_pan[PDX_gex_pan == "NA"] <- NA
table(is.na(PDX_gex_pan)) #FALSE:352317
#There are no missing valuesThere are no missing values present. The interactive heatmap of the CNV data was plotted. The dimension of the CNV data is 399 x 883.
#read the data
PDX_cnv_pan <- read.delim("data/PDX_all/oncoKB_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="Purples"))(12)
pheatmap::pheatmap(PDX_cnv_pan[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
main="Copy Number Data from the PDX",color = cols)#check the missing values of the data set
PDX_cnv_pan[PDX_cnv_pan == "NA"] <- NA
table(is.na(PDX_cnv_pan)) #FALSE:352317
#There are no missing valuesThere are missing values present in the mutation data and, variant plots could not be drawn in the PDX data set due to the same reasons why variant plots could not be drawn in other data sets. Instead,the histogram of the mutation frequency values were plotted.
PDX_mutdelim_pan <- read.delim(file = "data/PDX_all/oncoKB_intersect_genes/mut.tsv",
sep = "\t", header = TRUE, fill = TRUE,
comment.char = "#") #read the mutation data
hist(unlist(PDX_mutdelim_pan[,84]),xlab="Mutation Values",
main="The PDX Mutation Frequency",border="white",
col="Purple",bins = 10)#check the missing values of the data set
PDX_mutdelim_pan[PDX_mutdelim_pan == "NA"] <- NA
table(is.na(PDX_mutdelim_pan)) #FALSE: 75781 , TRUE: 787536
#There are missing valuesThere are no missing values present within the PDX_BRCA dataset. The dimension of the gene expression data is 41 x 18356.
#read the data
PDX_BRCA_gex_pan <- read.delim("data/PDX_BRCA/oncoKB_intersect_genes/gex.tsv") %>%
rename(cell_lines = X) #Change the name of cell lines column to cell_lines
#Check for the normality
hist(unlist(PDX_BRCA_gex_pan[,2:50]),xlab="Gene Expression values",
main="The PDX BRCA Gene Expression ",border="white",
col="Blue", bins = 100)#Plot the boxplots for the Gene Expression values of the first 50 columns
boxplot(PDX_BRCA_gex_pan[,2:50],outline=FALSE, col="Blue",
xlab = "Genes",xaxt="n",
ylab = "Frequency",main = "The PDX BRCA Gene Expression")#check the missing values of the data set
PDX_BRCA_gex_pan[PDX_BRCA_gex_pan == "NA"] <- NA
table(is.na(PDX_BRCA_gex_pan)) #FALSE:752596
#There are no missing valuesThere are no missing values present. The interactive heatmap of the CNV data was plotted.The dimension of the CNV data is 41 x 18356.
#read the data
PDX_BRCA_cnv_pan <- read.delim("data/PDX_BRCA/oncoKB_intersect_genes/cnv.tsv") %>%
rename(cell_lines = X) #Change the name of the cell lines column to cell_lines
cols <- colorRampPalette(brewer.pal(6,name="PuBu"))(12)
pheatmap::pheatmap(PDX_BRCA_cnv_pan[,2:50],cluster_rows=FALSE,show_rownames = FALSE,
main="Copy Number Data from the PDX BRCA", color = cols)#check the missing values of the data set
PDX_BRCA_cnv_pan[PDX_BRCA_cnv_pan == "NA"] <- NA
table(is.na(PDX_BRCA_cnv_pan)) #FALSE:752596
#There are no missing valuesThere are missing values present in the mutation data and, variant plots could not be drawn in the PDX BRCA data set due to the same reasons why variant plots could not be drawn in other data sets. Instead,the histogram of the mutation frequency values were plotted.
PDX_BRCA_mut_pan <- read.delim(
file = "data/PDX_BRCA/oncoKB_intersect_genes/mut.tsv",sep = "\t", header = TRUE,
fill = TRUE,comment.char = "#") #read the mutation data
hist(unlist(PDX_BRCA_mut_pan[,84]),xlab="Mutation Values",
main="The PDX BRCA Mutation Frequency",border="white",
col= "Blue",bins = 10)As mentioned in the introduction section, there are different drug response data present for six different drugs within the drug response folder. These are the six common drugs that are used for the treatment of several cancer types.
The distribution of each drug was plotted. Check the difference in drug response distribution for each drug.
NOTE : TheCCLE drug response data does not contain any missing values.
# Read and Combine all of the response data
drug_resp_CCLE <- read.delim("data/CCLE/drug_response/ALPELISIB.tsv") %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/BUPARLISIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/CETUXIMAB.tsv")) %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/ERLOTINIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/RIBOCICLIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/CCLE/drug_response/TRAMETINIB.tsv")) %>%
rename(drug = column_name)
table(drug_resp_CCLE$drug) #Print how many response data present ##
## ALPELISIB BUPARLISIB CETUXIMAB ERLOTINIB RIBOCICLIB TRAMETINIB
## 785 736 855 839 45 930
drug_plot <- drug_resp_CCLE %>% group_by(drug) %>%
ggplot(aes(x = value,fill=drug)) +
geom_histogram(color="black",bins = 50) +
facet_wrap(drug ~ ., scales = 'free_x')+
scale_y_sqrt()+
ggtitle("Distribution of the Response Value Among the Drugs :the CCLE")+
xlab("Response Values")+
ylab("Frequency")
ggplotly(drug_plot)drug_resp_CCLE[drug_resp_CCLE == "NA"] <- NA
#Introduce missing values
table(drug_resp_CCLE$drug, is.na(drug_resp_CCLE$value))
# check if there are missing valuesHistograms were plotted to represent the distributions of the response values for each drug. Check the plots. Does it look like the distribution of the CCLE’s drug response values?
NOTE : There are no missing values present within the GDSC BROAD dataset.
# Read and Combine all of the response data
drug_resp_BROAD <- read.delim("data/GDSC_broad/drug_response/ALPELISIB.tsv") %>%
dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/BUPARLISIB.tsv"))%>%
dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/CETUXIMAB.tsv"))%>%
dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/ERLOTINIB.tsv"))%>%
dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/RIBOCICLIB.tsv"))%>%
dplyr::bind_rows(read.delim("data/GDSC_broad/drug_response/TRAMETINIB.tsv"))%>%
rename(drug = column_name)
table(drug_resp_BROAD$drug) #Print how many response data present ##
## ALPELISIB BUPARLISIB CETUXIMAB ERLOTINIB RIBOCICLIB TRAMETINIB
## 761 712 806 1072 43 1550
drug_plot <- drug_resp_BROAD %>% group_by(drug) %>%
ggplot(aes(x = value , fill=drug)) +
geom_histogram(color="black",bins = 50) +
facet_wrap(drug ~ ., scales = 'free_x')+
scale_y_sqrt()+
ggtitle("Distribution of the Response Values: the GDSC BROAD")+
xlab("Response Values")+
ylab("Frequency")
ggplotly(drug_plot)drug_resp_BROAD[drug_resp_BROAD == "NA"] <- NA
#Introduce missing values
table(drug_resp_BROAD$drug, is.na(drug_resp_BROAD$value))
# check if there are missing valuesHistograms were plotted to represent the distributions of the response values for each drug. Check the plots. Does it look like the distribution of the CCLE’s and BROAD’S drug response values? In addition, the number of samples for each drug was printed in the table.
NOTE : There are no missing values present within the GDSC SANGER dataset.
# Read and Combine all of the response data
drug_resp_sanger <- read.delim("data/GDSC_sanger/drug_response/ALPELISIB.tsv")%>%
dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/BUPARLISIB.tsv"))%>%
dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/CETUXIMAB.tsv"))%>%
dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/ERLOTINIB.tsv"))%>%
dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/RIBOCICLIB.tsv"))%>%
dplyr::bind_rows(read.delim("data/GDSC_sanger/drug_response/TRAMETINIB.tsv"))%>%
rename(drug = column_name)
table(drug_resp_sanger$drug) #Print how many response data present ##
## ALPELISIB BUPARLISIB CETUXIMAB ERLOTINIB RIBOCICLIB TRAMETINIB
## 761 712 806 1072 43 1550
drug_plot <- drug_resp_sanger %>% group_by(drug) %>%
ggplot(aes(x = value , fill=drug)) +
geom_histogram(color="black",bins = 50) +
facet_wrap(drug ~ ., scales = 'free_x')+
scale_y_sqrt()+
ggtitle("Distribution of the Response Value Among the Drugs :GDSC SANGER")+
xlab("Response Values")+
ylab("Frequency")
ggplotly(drug_plot)drug_resp_sanger[drug_resp_sanger == "NA"] <- NA
#Introduce missing values
table(drug_resp_sanger$drug, is.na(drug_resp_sanger$value))
# check if there are missing valuesHistograms were plotted to represent the distributions of the response values for each drug. Check the plots. Does it look like the distribution of the CCLE’s, SANGER’s and BROAD’S drug response values? In addition, the number of samples for each drug was printed in the table.
# Read and Combine all of the response data
drug_resp_PDX <- read.delim("data/PDX_all/drug_response/ALPELISIB.tsv") %>%
dplyr::bind_rows(read.delim("data/PDX_all/drug_response/BUPARLISIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/PDX_all/drug_response/CETUXIMAB.tsv")) %>%
dplyr::bind_rows(read.delim("data/PDX_all/drug_response/ERLOTINIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/PDX_all/drug_response/RIBOCICLIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/PDX_all/drug_response/TRAMETINIB.tsv")) %>%
rename(drug = column_name)
table(drug_resp_PDX$drug) #Print how many response data present ##
## ALPELISIB BUPARLISIB CETUXIMAB ERLOTINIB RIBOCICLIB TRAMETINIB
## 281 281 281 281 281 281
drug_plot <- drug_resp_PDX %>% group_by(drug) %>%
ggplot(aes(x = value , fill=drug)) +
geom_histogram(color="black",bins = 50) +
facet_wrap(drug ~ ., scales = 'free_x')+
scale_y_sqrt()+
ggtitle("Distribution of the Response Value Among the Drugs :the PDX")+
xlab("Response Values")+
ylab("Frequency")
ggplotly(drug_plot)drug_resp_PDX[drug_resp_PDX == "NA"] <- NA
#Introduce missing values
table(drug_resp_PDX$drug, is.na(drug_resp_PDX$value)) ##
## FALSE TRUE
## ALPELISIB 212 69
## BUPARLISIB 224 57
## CETUXIMAB 72 209
## ERLOTINIB 29 252
## RIBOCICLIB 246 35
## TRAMETINIB 39 242
# check if there are missing valuesHistograms were plotted to represent the distributions of the response values for each drug. Check the plots. Does it look like the distribution of the CCLE’s, SANGER’s and BROAD’S drug response values? In addition, the number of samples for each drug was printed in the table.
# Read and Combine all of the response data
drug_resp_PDX_BRCA <- read.delim("data/PDX_BRCA/drug_response/ALPELISIB.tsv") %>%
dplyr::bind_rows(read.delim("data/PDX_BRCA/drug_response/BUPARLISIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/PDX_BRCA/drug_response/CETUXIMAB.tsv")) %>%
dplyr::bind_rows(read.delim("data/PDX_BRCA/drug_response/ERLOTINIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/PDX_BRCA/drug_response/RIBOCICLIB.tsv")) %>%
dplyr::bind_rows(read.delim("data/PDX_BRCA/drug_response/TRAMETINIB.tsv")) %>%
rename(drug = column_name)
table(drug_resp_PDX_BRCA$drug) #Print how many response data present ##
## ALPELISIB BUPARLISIB CETUXIMAB ERLOTINIB RIBOCICLIB TRAMETINIB
## 281 281 281 281 281 281
drug_plot <- drug_resp_PDX_BRCA %>% group_by(drug) %>%
ggplot(aes(x = value , fill=drug)) +
geom_histogram(color="black",bins = 50) +
facet_wrap(drug ~ ., scales = 'free_x')+
scale_y_sqrt()+
ggtitle("Distribution of the Response Value Among the Drugs :the PDX BRCA")+
xlab("Response Values")+
ylab("Frequency")
ggplotly(drug_plot)drug_resp_PDX_BRCA[drug_resp_PDX_BRCA == "NA"] <- NA
#Introduce missing values
table(drug_resp_PDX_BRCA$drug, is.na(drug_resp_PDX_BRCA$value)) ##
## FALSE TRUE
## ALPELISIB 212 69
## BUPARLISIB 224 57
## CETUXIMAB 72 209
## ERLOTINIB 29 252
## RIBOCICLIB 246 35
## TRAMETINIB 39 242
# check if there are missing values